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Abstract . We present a comparative study of the application of modern eigenvalue algorithms 
to an eigenvalue problem arising in quantum physics, namely, the computation of a few interior 
eigenvalues and their associated eigenvectors for the large, sparse, real, symmetric, and indefinite 
matrices of the Anderson model of localization. We compare the Lanczos algorithm in the 1987 
implementation of CuUum and Willoughby with the implicitly restarted Arnoldi method coupled 
with polynomial and several shift-and-invert convergence accelerators as well as with a sparse hy- 
brid tridiagonalization method. We demonstrate that for our problem the Lanczos implementation 
is faster and more memory efficient than the other approaches. This seemingly innocuous problem 
presents a major challenge for all modern eigenvalue algorithms. 
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1. Introduction. In this paper we present a comparative study of the apphca- 
tion of modern eigenvalue algorithms to an eigenvalue problem arising in quantum 
physics. The task is to compute a few (5-10) interior eigenvalues and the associ- 
ated eigenvectors of a family of structured large, sparse, real, symmetric, indefinite 
matrices. The off-diagonal elements are equal to the off-diagonal elements of the 
7-point central difference approximation to the three-dimensional Poisson equation 
on the unit cube with periodic boundary conditions. The matrices differ from each 
other only in the diagonal entries, which are suitably chosen random numbers. 

Previously this problem was often solved by using the 1987 CuUum and Wil- 
loughby implementation of the Lanczos algorithm ||, ^, in the following called 
CWI. But in the last 10 years several new eigenvalue methods have been developed 
and implemented as software packages, that seem, at least at first glance, more 
appropriate than CWI, see, e.g., the recent survey and comparison given in p9| . 
We apply these new codes to the described family of matrices and check whether 
they are faster and more memory efficient than CWI. To our surprise, none of the 
tested codes is consistently better than CWI. As we show below, we find only a 
single new code which is at least as fast as CWI. But this code needs two orders of 
magnitude more memory than CWI. We therefore believe that the described family 
of matrices will present an important new benchmark example and will hopefully 
lead to modifications and improvements for the current methods. 

The paper is organized as follows. In §^ we describe the underlying quantum 
physics problem, i.e., the Anderson model of localization, and introduce the pa- 
rameters used in our study. In §|^ we briefly review the CuUum/ Willoughby version 
of the Lanczos method that has been previously used in the simulations for this 
model. We then give in §^ a brief survey of more recent eigenvalue methods. In 
§ H we present comparative results for the different methods and show that CWI is 
faster and needs less memory than all other approaches. 

2. The Anderson model of localization. The Anderson model of local- 
ization iQ] is a convenient model for the investigation of electronic properties of 
disordered systems. Although it represents a severe simplification of amorphous 
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materials and alloys, it has nevertheless become a paradigmatic model and is cur- 
rently widely used in the theoretical description of quantum mechanical effects of 
disorder such as, e.g., spatial localization of electronic wave functions with increasing 
strength of disorder and the corresponding metal-insulator transitions |2|, 
The quantum mechanical problem is represented by a Hamilton operator in the 
form of a real symmetric matrix A and the quantum mechanical wave functions 
are simply the eigenvectors of A, i.e., finite vectors x with real entries. E.g., for a 
simple cubic lattice with M — N x N x N sites, we have to solve the eigenvalue 
equation Ax — \x, which is given in site representation as 

(2.1) Xi-ij,k + Xi+ij,k + Xij-i,k + Xij+i^k + Xij^k-i + Xij^k+i + £i,j,kXij,k 



with i,j,k denoting the cartesian coordinates of a site. The off-diagonal entries 
of A correspond to hopping probabilities of the electrons fro m on e site to a neigh- 
boring site. For simplicity, we have set them all to unity in {2A). The disorder is 
encoded in the random potential site energies Sij^k on the diagonal of the matrix 
A. We consider only the case of £i.j,fc being uniformly distributed in the interval 
[—w/2, +w/2]. This is a common simplification, usually used in the studies of the 
Anderson model of localization with typical values of w ranging from 1 to 30. The 
boundary conditions are usually taken to be periodic, but hard wall and helical ||2^ 
boundary conditions are sometimes also used. According to the Gersgorin circle the- 
orem [T^ every such matrix A has eigenvalues in the interval [~w/2 — 6, +w/2 + 6]. 
Possible generalizations of the Anderson model include anisotropic or even ran- 
dom hopping 1^] and various choices of the distribution function of the site energies 
[Q. However, the graph of the matrix remains the same. 

Although the above matrix seems to be fairly simple, the intrinsic physics is 
surprisingly rich. For small disorder (w <C 16.5), the eigenvectors are extended, 
i.e., Xij^k is fluctuating from site to site but the envelope |a;| is approximately a 
non-zero constant. For large disorder [w ^ 16.5), all eigenvectors are localized, i.e., 
the envelope of the nth eigenstate may be approximately written as exp[— |r — 
^n\/ln{w)] with r = {i,j,k)'^ and ln{w) denoting the localization length of the 
eigenstate at the specified strength w of the disorder. In Fig. 2T, we show examples 
of such states for the Anderson model in one spatial dimension. Since extended 
states can contribute to electron transport, whereas localized states cannot, the 
Anderson model thus describes a metal-to-insulator transition: In three-dimensional 
samples at w = Wc ~ 16.5, the extended states at A « vanish and no current can 
flow. The eigenvector properties are also connected with the statistical properties 
of the spectrum cr(A) of A. In the extended regime one finds level repulsion, while 
in the localized regime the eigenvalues are uncorrelated resulting in level clustering. 
These results agree quantitatively with random matrix theory p8| . Directly at Wc 
there is a so-called critical regime where the eigenvectors are multifractal entities 
pl| , pof showing characteristic fluctuations of the amplitude on all length scales. 
In order to numerically distinguish these three regimes, namely localized, critical 
and extended behavior, one needs to (i) go to extremely large system sizes and (ii) 
average over many different realizations of the disorder, i.e., compute eigenvalues 
or -vectors for many matrices with different diagonals. 

In the present paper we concentrate on the computation of a few eigenvalues 
and corresponding eigenvectors for the physically most interesting case of critical 
disorder Wc and in the center of (j{A), i.e., at A = 0, for system sizes as large as 
possible. In Fig. 2.2, we show a histogram of (j{A) for different disorders. Note the 
high density of states at A = in all cases. Therefore we have the further numerical 
challenge of distinguishing clearly the eigenstates in this high density region. 
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Fig. 2.1. Extended (dashed line) and localized (thick solid line) eigenstate for a single 
realization of the Anderson model in one spatial dimension with N = 200 sites and periodic 
boundary conditions. For the localized eigenstate, we also show the exponential envelope with 
localization length I ^ 12 ( thin lines ) according to § 
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Fig. 2.2. Histogram n(A) of eigenvalues for a single system with = 48'^ sites and w = 10 
(n), 16.5 (o), and 20 (<>). The bin width is 0.05. The lines are obtained by consecutively averaging 
20 bins. 
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3. The Lanczos algorithm and the Cullum/Willoughby implementa- 
tion. As outlined in the last section, each of the matrices A is sparse, symmetric and 
indefinite. Furthermore, the matrix-vector multiplication Ax can be written explic- 
itly as in ( ^.l[ ) and is thus easily implemented. An ideal candidate for an algorithm 
taking advantage of nearly all these properties is the Lanczos algorithm . This 
algorithm iteratively generates a sequence of orthogonal vectors vt, i = 1,...,K, 
such that Vj^AVk = Tk, with V = {vi,V2, ■ ■ ■ ,vk} and Tk a symmetric tridiagonal 
K X K matrix. One obtains the recursion 



(3.1) f3i+iVi+i = Avi - a^Vi - (3iVi-i, 

where ai = vj Avi and (ii+i — Vi^iAvi are the diagonal and subdiagonal entries of 
Tri = Q and Vi is an arbitrary starting vector. For K — M in exact arithmetic 
this is an orthogonal transformation to tridiagonal form that needs M matrix- vector 
multiplications. The eigenvalues of the tridiagonal matrix Tk, also known as Ritz 
values, are then simply the eigenvalues of the matrix A and the associated Ritz 
vectors yield the eigenvectors [|6| [z], p2|, p5|. 

In finite precision arithmetic, however, the Lanczos vectors Vi loose their or- 
thogonality after a small number of Lanczos iterations. Consequently, there appear 
so called "spurious" or "ghost" eigenvalues in a{TK), which do not belong to (j{A). 

There are several solutions to this problem: total reorthogonalization of all 
Lanczos vectors against each other, selective reorthogonalization [Q, or distin- 
guishing between good and spurious eigenvalues. While the reorthogonalization 
leads to an increase in memory requirements and computing time, since all or sev- 
eral of the Vi need to be stored and reorthogonalized, the solution implemented 
in CWI [Q uses a simple and highly successful procedure to identify the spurious 
eigenvalues, thereby avoiding reorthogonalization. It thus only uses two Lanczos 
vectors in each iteration step and consequently the memory requirements are very 
small. An eigenvalue of Tk is identified as being spurious if it is also an eigenvalue 
of the matrix T^ which is constructed by deleting the first row and column of Tk- 
Still, the good eigenvalues produced may not yet have converged properly for a 
given K. So we further use the fact that good eigenvalues will be replicated in 
(t{Tk) if K is large enough. We only accept eigenvalues as being good eigenvalues 
after they have been replicated at least once in (t(Tx). Hence we usually need at 
least K > 2M. Finally, in order to obtain the eigenvectors corresponding to these 
good eigenvalues of A, all Lanczos vectors must be computed a second time, again 
doubling the computational effort. 

The convergence of the Lanczos algorithm is very fast for the eigenvalues close 
to min a{A) and max (t{A). This is especially true if these eigenvalues are well 
separated. However, for eigenvalues in the interior of ct{A) and for eigenvalues which 
are not well separated the convergence is slow. Furthermore, the tridiagonal matrix 
Tk becomes very large for an iteration in the interior of o-{A). Nevertheless, the CWI 
has been used to study the Anderson model of localization even at A = successfully 
for years |2^, |lj, ^ and eigenstates for matrices with N — 100 can be 

obtained within a few weeks of computing time [ p4[ . 

We also remark that most of the computational effort in the Lanczos algorithm 



is spent on the iteration of (3.1), i.e., on matrix- vector multiplications and vector 



additions. These can be easily parallelized and thus the CWI is well suited for 



parallel architectures. For example, the eigenspectra presented in Fig. 2.2 have 
been obtained by such a parallel version of CWI running for about 60 hours for 
each realization using 16 processors of a Parsytec GCC Power Plus. 

4. Modern approaches. Lately there has been much progress in eigenvalue 
methods mostly concentrating on non-symmetric matrices. We refer to [ p^ for 
a recent survey. The symmetric problem is usually assumed to be taken care of 
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implicitly. But although our symmetric eigenvalue problem is well-conditioned |13| , 
the fact that the eigenvalues are clustered in the neighborhood of A = 0, our region 
of interest, creates difficulties for all numerical methods. Promising choices for 
possible replacements of the CuUum and Willoughby approach are the implicitly 
restarted Arnoldi method and the hybrid tridiagonalization (HTD) algorithm 
of Cavers Another new approach is the Jacobi-Davidson method In the 
following we will pay special attention to the Arnoldi approach, since it allows the 
easy use of the shift-and-invert technique. We expect this to overcome the above 
mentioned clustering problem at A = 0. 

4.1. Modifying the eigenproblem. The problem of slow convergence in the 
interior of (t{A) can be overcome by computing eigenvalues and eigenvectors for a 
modified eigenvalue problem f{A)x — f{X)x. The function / is chosen such that the 
desired point A in cr is mapped onto or close to the minimum or maximum of cr(/). 
Furthermore, one should choose /, such that cr(/) has well separated eigenvalues at 
min (t(/) and max cr(/). 

Among the many possible choices for f(A), we shall consider in the following: 
(i) polynomial convergence accelerators, where f{A) is chosen as a polynomial which 
has its maximum (or minimum) at A. This moves A to max a{f{A)) (or min a{f{A)) 
resp.). We remark that occasionally these convergence accelerators are somewhat 
misleadingly called preconditioners. (ii) shift-and-invert with f{A) = [A — 
and / the M x M identity matrix. This choice of / requires the additional solution 
of a linear system with A — XI in each step of the eigenvalue iteration ^ . For 
the solution of this linear system there are again two alternatives: (ii.a) direct sparse 
solvers for A — XI. Unfortunately, not many direct solvers exist which can make 
efficient use of the sparseness for indefinite problems, (ii.b) iterative solvers using 
only matrix-vector multiplications with A — XI. The iterative methods promise to 
make large matrix sizes possible, since they benefit in an optimal way from sparsity. 
Memory requirements and computational cost of a matrix-vector multiplication are 
proportional to the number of non-zeros and therefore proportional to M for our 
present problem. However, we note that in our case {A — XI) is indefinite which 
will lead to slow convergence for most iterative solvers. Since the convergence of 
the iterative solver is dominated by the condition number of the linear system, one 
may employ preconditioners to accelerate its convergence jl^ . 

All these approaches result in a competition between smaller numbers of Lanc- 
zos or Arnoldi iterations and increased costs for each such iteration step. For this 
reason it is not a priori clear whether they will indeed give a net reduction in 
computation time. 

4.2. The implicitly restarted Arnoldi method. In a recent comparison 



|19(| of different Arnoldi based packages ArPack |20| was found to be the fastest 
and most reliable of the compared codes. 

When applied to symmetric eigenvalue problems, the main difference between 
the techniques in ArPack based on the Arnoldi iteration and the Lanczos iteration 
is an implicit restart technique. The Arnoldi method stores a number of Ritz 
vectors produced by the iterations and after a small number of steps initiates a 
restart which uses an implicit QR-algorithm for the small eigenvalue problem to 
create a new starting vector and to maintain orthogonality among the Ritz vectors. 
In contrast to the Lanczos algorithm more vectors have to be stored but spurious 
eigenvalues are avoided. 

The ArPack implementation further allows the easy use of additional acceler- 
ation methods such as polynomial convergence acceleration and shift-and-invert as 
outlined above. So ArPack is probably the best choice for a replacement of CWI. 
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4.3. Other approaches. We have also studied the use of the HTD method Q 
and the J acobi/ Davidson method The HTD method is a direct tridiagonahza- 
tion method specificaUy designed for sparse matrices. This makes it an interesting 
approach for our purposes. However, it is not exphcitly designed to compute only 
few interior eigenvalues and associated eigenvectors. 

The Jacobi/Davidson method appears to be another promising future direc- 
tion if it can be properly accelerated. The current version is designed for complex 
unsymmetric problems and to report comparative results would not be fair to this 
interesting new development. We intend to further study this method in the future. 

5. Results. After a short discussion of the specific implementations and pa- 
rameters, we now present the results of our comparison. The tests are performed 
on Hewlett-Packard HP9000 735/125 workstations for < 24^ and on a HP9000 
K460 with the fast PA8000 processor for > 24^. The latter machine allows 
us to use up to 1.9 GB RAM and is about 3.5 times faster. In order to obtain 
a fair comparison we always require that the eigenresidual of the computed eigen- 
value/eigenvector pair satisfies \Ax — Xx\ < 10~*. The CPU times have been mea- 
sured using the Unix time command of the tcsh shell. Even for the largest system 
sizes considered, we have usually taken at least 5 different realizations of disorder 
and averaged the resulting CPU times. Sometimes, when the CPU times for a given 
algorithm fluctuate widely, we report the range of times instead of a simple average. 
We remark that the use of time introduces a further uncertainty into the results 
such that we always have an error of about 10%. The random number generators 
used are rcLn2 from |Q and the rand command from Matlab. 

5.1. The standard approach. For our particular problem we can reach M — 
SO'' — 512000 with CPU times of about two weeks on the K460 machine using 
CWI. However, keeping in mind the configurational averaging necessitated by the 
underlying physical problem, a reasonable upper limit for the matrix size is M = 
50^ = 125000. 



In Table 5.1 we show the CPU times obtained for CWI in the center and at 
the edge of <j{A). Note that the computing times are nearly independent of the 
disorder parameter W, but, as expected, CWI is much faster at the edges of <t{A) 



than at A = 0. In Table |5.2| , we show the results for CWI at A = in dependence 
on the matrix size M. 

In the ArPack implementation of Lehoucq et al. ||2^ one has to set the pa- 
rameter NCV which is the largest number of basis vectors that will be used in the 
implicitly restarted Arnoldi process. In normal mode and in the interior of a{A) we 
find that the actual value of NCV heavily influences computing time as shown in 
Table |5.3|. This dependence on NCV becomes less pronounced for eigenvalues close 



to min a{A) and max a{A) as shown in Table 5.4. However, since we do not know 



of any strategy to choose NCV optimally, this is a severe restriction of ArPack. 



Furthermore, as shown in Table 5.1, ArPack, working in normal mode, is much 



slower than CWI both in the center and at the edge of o-{A). It becomes too slow 



for practical use already at M — 12"^ — 1728 as shown in Table 5.2 



In Tables p.l\ and 5.2 we also include CPU times for the HTD method. Note 
that we only show the CPU times needed to transform A to tridiagonal form. 
Nevertheless, we find that HTD is much slower than CWI. We remark that when 



we use CWI to compute the full spectrum as in Fig. 2.2, it is still faster than HTD 
except for small system sizes M < 12'^ ~ 1728. 

5.2. Polynomial convergence acceleration. As outlined above, polyno- 
mial convergence acceleration is usually a convenient choice to speed up the com- 
putation of eigenvalues and -vectors corresponding to a small region of cf{A). Here, 
we test a polynomial provided by D. Sorensen, one of the authors of ArPack, and 
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M 


W 


CWI 


ArPack 


HTD 


1000 


10.0 


2.4 


240 - 2200 


22 


1000 


16.5 


2.5 


230 - 1400 


22 


1000 


20.0 


2.4 


140 - 1300 


23 


1728 


10.0 


7.9 


1700 - 12000 


170 


1728 


16.5 


7.8 


410 - 12000 


170 


1728 


20.0 


7.6 


1100 - 20000 


160 


4096 


10.0 


43 




2600 


4096 


16.5 


40 




2500 


4096 


20.0 


40 




2500 


1000 


10.0 


0.71 


0.78 


22 


1000 


16.5 


0.77 


0.85 


22 


1000 


20.0 


0.80 


0.89 


23 


1728 


10.0 


0.94 


1.5 


170 


1728 


16.5 


1.0 


1.7 


170 


1728 


20.0 


1.1 


1.8 


160 


13824 


10.0 


9.4 


57 




13824 


16.5 


9.2 


57 




13824 


20.0 


9.3 


71 





Table 5.1 

CPU times in seconds to compute 5 eigenvectors with CWI, ArPack, t 
part of the table corresponds to eigenvalues in the interior of <t{A) at A 
corresponds to the 5 largest eigenvalues. 



i HTD. The upper 
0, the lower part 



M 


CWI 


CWI+ 
conv. acc. 


ArPack 


ArPack+ 
conv. acc. 


HTD 


1000 


2.5 


5.6 


230 - 1400 


4.9 


22 


1728 


7.8 


11 


410 -12000 


10 


170 


4096 


40 


59 




66 


2500 


13824 


770 


1700 




1700 




13824 
27000 
91125 
110592 


220 
1000 
20000 
35000 






550 
3500 





Table 5.2 

CPU times in seconds to compute at w = 16.5 the eigenvectors corresponding to the 5 
eigenvalues closest to X = with CWI, CWI with Chebyshev-polynomial acceleration, ArPack 
in normal mode, ArPack with Chebyshev-polynomial acceleration and HTD for various matrix 
sizes M . The CPU times in the upper (lower) part of the table have been measured on the HP 
735 (HP K460). 



C. Sun |31 
recursion: 



It is based on a Chebyshev-type polynomial given by the following 



(5.1) 



Piix) 

P2{x) 
Pn+l{x) 



1 

a 



= 2{a + hx^)pn 



where a — {x\-\-x\) / {x\—x\) and b = 2/ {x\—x\). Also, Pn is symmetric with a local 
maximum Pn(0) > 1 at zero, |p„(a;)| < 1 in the interv als \ xi, X2\ and [— 2:2,— xi], 
and Pn grows rapidly for |a;| > X2 as shown, e.g., in Fig. 5.1 for n = 20. In general, 
one would like to have xi and X2 chosen automatically in order to obtain a suitable 
function / as described in In all our present calculations we use n = 50 with 
{xiY = 0.005 and {X2Y = 1-1 [max (t{A)Y according to 0. This polynomial 
convergence acceleration speeds up ArPack immensely as one can see in Table 5.2. 
Since A = is now mapped to max a[p^o{A)), the actual value of NCV is less 
important. We found NCV = 50 to be a good choice to make the execution times 
faster, although this requires more memory. Still CPU times are about a factor of 
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90 


NOV 

110 


130 


1 


2577 


4337 


9455 


2 


2011 


2267 


3455 


3 


4190 


1935 


3635 


4 


411 


755 


1204 


5 


2435 


1811 


11620 


6 


8188 


4704 


2506 



Table 5.3 

CPU times in seconds to compute 5 eigenvectors with ArPack for 6 different diagonals and 
3 choices of NCV for M = 1728 and w = 16.5 at\ = 0. 









NCV 








w = 


= 10 


VJ = 


16.5 


w = 


20 




15 


20 


15 


20 


15 


20 


1 


41 


44 


63 


62 


56 


58 


2 


65 


65 


43 


47 


49 


55 


3 


57 


60 


54 


64 


44 


47 


4 


62 


62 


44 


49 


49 


55 


5 


72 


76 


72 


68 


76 


71 


6 


44 


45 


56 


62 


87 


85 


7 


81 


70 


72 


88 


108 


90 


8 


52 


60 


49 


51 


70 


66 


9 


42 


48 


58 


68 


99 


75 



Table 5.4 

CPU times in seconds to compute the eigenvectors corresponding to the 5 largest eigenvalues 
of A with ArPack for 9 different diagonals and 2 choices of NCV with M = 13824. 
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Fig. 5.1. Chebyshev polynomial p2o{x) with xi = 1 and X2 = 10. 
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M 


CWI 


ArPack+ 


HTD 






ArPack+ 




conv. acc. 


LU 


BKP 


SuperLU 


MA27 


MA27+HB 


1000 


0.24 


0.6 


1.4 


7.4 


10 


7.8 


0.8 


0.4 


1728 


0.43 


1.2 


6.7 


18 


22 


19 


2.2 


0.9 


4096 


1.0 


2.3 


13 


80 


81 


92 


8.6 


2.9 


13824 


3.4 


6.6 










70 


22 


27000 


6.5 


15 










200 


68 


91125 


22 












1300 


500 


110592 


27 












2600 


600 



Table 5.5 

Memory requirements in MB to compute at w = 16.5 the eigenvectors corresponding to the 
5 eigenvalues closest to X = for the different diagonalizers (Names as in the text, HB indicates 
hard wall boundary conditions) 



M 


CWI 






ArPack+ 




LU 


BKP 


SupcrLU 


MA27 


MA27+HB 


1000 


2.5 


39 


74 


8.8 


1.3 


0.88 


1728 


7.8 


150 


300 


28 


5.0 


2.0 


4096 


40 


1200 


1900 


220 


39 


9.8 


13824 


770 








740 


140 


13824 


220 








260 


58 


27000 


1000 








1300 


250 


91125 


20000 








19000 


4900 



Table 5.6 

CPU times in seconds to compute at w = 16.5 the eigenvectors corresponding to the 5 
eigenvalues closest to X = with shift- and-invert ArPack and different direct solvers. For easier 
comparison, we also include CWI. The CPU times in the upper (lower) part of the table have 
been measured on the HP 735 (HP K460). 



two larger than for CWI without any convergence acceleration. Unfortunately, CWI 



itself is not made faster by the use of this accelerator as also shown in Table 5.2. 
Although the number of Lanczos vectors needed to achieve convergence is reduced 
remarkably, the additional computational effort now required for every Lanczos 
step becomes very large. At the end one needs even slightly more matrix-vector 
multiplications than without convergence acceleration. 

5.3. Shift-and-invert with direct solvers. We now discuss the use of the 

shift-and-invcrt mode of ArPack together with a direct solver for the linear system 
{A — XI)y = b. We first note that although our matrix A is symmetric, it is 
not positive definite and thus we cannot use a sparse Cholesky decomposition. 
Unfortunately, there are only few packages available for sparse symmetric indefinite 
problems Therefore we also investigated several packages for general sparse 
matrices. 

Meschach is a freely available mathematical package written in C. There 
are three sparse factorization methods implemented in Meschach: Cholesky, LU, 
and Bunch-Kaufmann-Parlett (BKP). Cholesky factorization does not work due to 
the indefiniteness of A. LU and BKP are supposed to take advantage of the sparse- 
ness of our problem. However, we find that they have huge memory requirements 



of the order of as shown in Table 5.5. So they are inapplicable for large system 



sizes. And even for small systems they turn out to be much too slow as shown in 



Table |5^. 

The Harwell Subroutine Library |]l6| contains the sparse symmetric indefinite 



solver MA27. As shown in Table |5^, ArPack with MA27 is about as fast as 
CWI. In fact, MA27 seems to become faster than CWI for M > 45^ = 91125. 
Unfortunately we could not test this because of the huge memory requirements of 



MA27 as shown in Table 5.5 
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M 


CWI 


ArPack+ 




QMRX 


QMRL 


CPX 


1000 


2.5 


68 


93 


85 


1728 


7.8 


239 


320 


330 


4096 


40 


1600 




2300 


13824 


770 


35000 






13824 


220 


12000 






27000 


1000 








91125 


20000 









Table 5.7 

CPU times in seconds to compute at w = 16.5 the eigenvectors corresponding to the eigen- 
values closest to \ = with shift- and-invert ArPack and the iterative solvers from QMRPack. 
For easier comparison, we also include CWI. The CPU times in the upper (lower) part of the 
table have been measured on the HP 735 (HP K46O). 



A noteworthy fact is that MA27 is much better for hard wall boundary condi- 
tions (HB). This can be explained by the fact that the bandwidth of the matrix is 
©(A^^) instead of 0{N^) as for periodic boundary conditions. However, on physical 
grounds, a calculation with HB is expected to be much more influenced by the finite 
size of the cubes considered. So although we can obtain larger system sizes here, the 
results for the interesting physical quantities may not be as reliable. Nevertheless, 
ArPack with MA27 for matrices with HB is faster than CWI for matrices with 
periodic boundary conditions. MA27 with HB is also faster than CWI with HB, 
since for CWI there is only a negligible difference in computing time between HB 
and periodic boundary conditions coming from the matrix-vector multiplication. 

SuperLU is a package by Demmel et al. [|| doing a sparse LU decomposition. 
Compared with CWI and MA27, SuperLU is much slower as shown in Table 
Furthermore, it needs about one order of magnitude more memory than MA27 as 



shown in Table 5.5, SuperLU allows the input of different preorderings in addition 
to the default minimum-degree ordering. We have tested a symmetric minimum 
degree ordering from the Matlab program and a nested dissection ordering com- 
puted by the Chaco package For some choices of diagonals we derive small 
savings in run time and/or memory but these are not consistent, i.e., the same kind 
of ordering speeds up the program for one choice of A^ and slows it down for A^ + 1 . 

5.4. Shift-and-invert with iterative solvers. Considering the recent ad- 
vances in iterative solvers, we initially hoped that ArPack in shift-and-invert mode 
coupled with a modern iterative method for the solution of linear systems would be 
quite efficient. As we will show below, this is not the case. 

The quasi-minimal-residual (QMR) technique should be one of the best iterative 
solvers for symmetric matrices that works using only matrix- vector multiplications 
if no preconditioning is used JT^ . However, as shown in Fig. 2.2, our matrices 
are indefinite with a nearly symmetric eigenvalue distribution around zero. This 
results in a very bad iteration count of about 2M for the solution of a single linear 
system of size M . The times and iteration numbers from three variants implemented 
in QMRPack |i^, namely, QMR based on three-term Lanczos with and without 
look-ahead (QMRL/QMRX) and QMR based on coupled two-term Lanczos without 



look-ahead (CPX) are not very different as shown in Table 5.7. For all three methods 
the iteration count is rather high. Consequently, we find that ArPack in shift-and- 
invert mode coupled with QMRPACK as iterative solver is about 20 times slower 
than CWI. The ArPack input parameter NCV was set to 15. We also find that 
the implementation of QMR based on coupled two-term Lanczos with look-ahead 
does not converge for larger systems within 50000 iterations. 

In order to check if other iterative methods are perhaps more efficient than 
QMR for our family of matrices, we have also tried several such iterative solvers 
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M 


QMR 


CGS 


Bi-CG 


Bi-CGSTAB 


512 


796 - 999 


843 - n.c. 


809 - 980 


1006 - n.c. 


1000 


1705 - n.c. 


1762 - n.c. 


1701 - n.c. 


n. c. 


1728 


2918 - n.c. 


n.c. 


2932 - n.c. 


n. c. 


2744 


4809 - n.c. 


n.c. 


4775 - n.c. 


n. c. 


4096 


7270 - n.c. 


n.c. 


7401 - n.c. 


n. c. 



M 


GMRES 


GMRES(5Ar) 


QMR+jac 


QMR+tri 


512 


511- 512 


n.c. 


813 - n.c. 


295 - 569 


1000 


995-1000 


n.c. 


n.c. 


1389 - n.c. 


1728 


1723-1728 


n.c. 


n.c. 


n. c. 


2744 


2736-2744 


n.c. 


n.c. 


n. c. 


4096 




n.c. 


n.c. 


n. c. 



Table 5.8 

Number of iterations needed in Matlab in order to solve the linear system Ay = b. The 
abbreviations for the different algorithms are explained in the text. The runs are aborted uihen 
the number of iterations is more than 2M . This case of no convergence is indicated by "n.c". 



using the Matlab programming environment. In addition to QMR, we have consid- 
ered the conjugate-gradient-squared method (CGS) |Q, the BiConjugate-Gradient 
method (BiCG) 0, its stabihzed variant (Bi-CGSTAB) Q and the generahzed- 
minimal-residual (GMRES(k)) method |2^. Furthermore, several general purpose 
preconditioners ||^, i.e., the Jacobi (jac) preconditioner, the ILU(O) preconditioner 
and also the three main diagonals as the preconditioning matrix (tri) have been 
tested. Since the performance of Matlab programs cannot directly be compared 
to compiled programs, we only give the iteration count of each algorithm. One such 
iteration requires at least one matrix-vector multiplication and two inner products 
and is thus at least as expensive as one Lanczos step. We always use the built-in 
implementations of these algorithms as in Matlab v5.1. 

In Table 5^ , we show results obtained for various matrix sizes M. The ranges 
reflect the variations corresponding to 12 different realizations of disorder on the 
diagonal of the matrices. Note that for the same AI we use the same 12 diagonals 
for all algorithms. We always choose xq = as initial vector. The iteration count 
represents the number of iterations needed to solve the matrix equation Ax = y 
up to a relative accuracy of 10~®. We always stop the algorithms if after 2M 
iterations this accuracy has not been achieved. For practical restart values k < 200, 
GMRES(fc) does not converge at all within our iteration limit. With no restarts, 
GMRES needed M or slightly less iterations. But note that both memory and 
computing-time requirements for M steps of pure GMRES exceed those of a non- 
sparse direct solver. 

None of the tested preconditioners is consistently effective. The Jacobi precon- 
ditioner in fact increases the iteration count most of the time. The ILU(O) pre- 
conditioner returns a singular matrix and consequently appears inapplicable. The 
tridiagonal preconditioner is more effective, in some cases reducing the iteration 
count by up to 50%. But again there are examples where it fails to do anything. 
We remark that in general the iteration count is consistent with the results from 
QMRPack. To sum up, we find that all of these iterative algorithms do not perform 
better that the QMR algorithm and consequently are no real alternative. 

Another idea is to work with the matrix instead of A. Since it is symmetric 
and positive definite, we can now use the conjugate gradient method. But this 
squares the condition number of the linear system, which is already usually very 
large for A ||l^. Hence more effort has to be invested into the development of a 
good preconditioner. We find for our matrices that while the iteration count is in 
general a bit less than for the methods mentioned above and the preconditioners 
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are more consistently effective we still need of the order of M steps with at least two 
matrix-vector multiplications for the solution of one linear system. And since the 
shift-and-invert ArPack still needs to solve several linear systems, all the iterative 
methods working on were no match for CWl. 

6. Summary. We have tested several modern methods to compute a few inner 
eigenvectors of a very large sparse matrix corresponding to the Anderson model of lo- 
calization motivated within theoretical physics. Particularly the implicitly restarted 
Arnoldi method in connection with polynomial convergence acceleration and in 
shift-and-invert mode with several direct and iterative solvers for systems of linear 
equations is compared to the CuUum/Willoughby implementation of the Lanczos 
method. Despite the recent progress in linear system solvers we find all considered 
modern methods to be inapplicable for very large system sizes, because either the 
computation times or the memory requirements are much to large. To sum up, we 
find that CWI Lanczos is currently still the most efficient method for the matrix 
type we are interested in. We emphasize that the CWI Lanczos, with our slight 
modifications as outlined in §|^, is a reliable tool for our problem. In particular, the 
problem of spurious eigenvalues which plague the original Lanczos algorithm, can 
be handled safely. 

Since large scale diagonalizations are widely used in theoretical physics — and 
also theoretical chemistry [|] — we would be happy to learn about any algorithm 
that does better than CWI for our matrices. We are especially interested in a 
preconditioner for the iterative methods which is suitably adapted to our problem. 
Certainly improved direct methods for our matrix type are also of great importance. 
We hope to have convinced the reader that it may be worthwhile to rethink seem- 
ingly easy problems like the present eigenproblem for real and symmetric matrices. 

Acknowledgments. We thank lain Duff for supplying us with Harwell's 
MA27 routine and Dan Sorensen and Chao Sun for their polynomial convergence 
accelerator. 
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